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ABSTRACT 

Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data 
pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful 
information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means 
clustering algorithm is widely used for many practical applications. But the original k-means algorithm is 
computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial 
centroids. Several methods have been proposed in the literature for improving the performance of the k-means 
clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient; so as 
to get better clustering with reduced complexity. 
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I. INTRODUCTION 

Advances in scientific data collection methods have resulted in the large scale accumulation of 
promising data pertaining to diverse fields of science and technology. Owing to the development of novel 
techniques for generating and collecting data, the rate of growth of scientific databases has become tremendous. 
Hence it is practically impossible to extract useful information from them by using conventional database 
analysis techniques. Effective mining methods are absolutely essential to unearth implicit information from 
huge databases. Cluster analysis [3] is one of the major data analysis methods which is widely used for many 
practical applications in emerging areas like Bioinformatics [1, 2]. Clustering is the process of partitioning a 
given set of objects into disjoint clusters. This is done in such a way that objects in the same cluster are similar 
while objects belonging to different clusters differ considerably, with respect to their attributes. The k-means 
algorithm [3, 4, 5, 6, 7] is effective in producing clusters for many practical applications. But the computational 
complexity of the original k-means algorithm is very high, especially for large data sets. Moreover, this 
algorithm results in different types of clusters depending on the random choice of initial centroids. Several 
attempts were made by researchers for improving the performance of the k-means clustering algorithm for 
improving the accuracy and efficiency of the k-means algorithm. 

II. BASIC K-MEANS CLUSTERING ALGORITHM 

The K-Means clustering algorithm is a partition -based cluster analysis method [10]. According to the 
algorithm we firstly select k objects as initial cluster centers, then calculate the distance between each object and 
each cluster center and assign it to the nearest cluster, update the averages of all clusters, repeat this process 
until the criterion function converged. Square error criterion for clustering 
k ni 

E = £ X (xij -mi)2 , XiJ is the sample j of i-class, 
i=lj=l 

mi is the center of i-class, ni is the number of samples of i-class. Algorithm process is shown in Fig K-means 

clustering algorithm is simply described as follows: 

Input: N objects to be cluster (xj, Xz . . . xn), the number of clusters k; 

Output: k clusters and the sum of dissimilarity between each object and its nearest cluster center is the smallest; 

1) Arbitrarily select k objects as initial cluster centers (m], m2 ... mk); 

2) Calculate the distance between each object Xi and each cluster center, then assign each object to the nearest 
cluster, formula for calculating distance as: 

d(x"m,) = Id (xil- mjl)' , i=l. . . N; j=l. . . k; /=1 d (Xi, mJ) is the distance between data i and cluster j; 

3) Calculate the mean of objects in each cluster as the new cluster centers, 1 "' m. = -Ix, i=l, 2 . . . k; Nds the 
number of samples of, NF1" current cluster i; 

4) Repeat 2) 3) until the criterion function E converged, return (m), m2 . . . mk). Algorithm terminates. 
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Fig 2.1 K mean algorithm process 

III. RELATED WORK 

Several attempts were made by researchers to improve the effectiveness and efficiency of the k-means 
algorithm [8]. A variant of the k-means algorithm is the k-modes [9]. Method which replaces the means of 
clusters with modes. Like the k-means method, the k-modes algorithm also produces locally optimal solutions 
which are dependent on the selection of the initial modes. The k-prototypes algorithm [9] integrates the k-means 
and k-modes processes for clustering the data. In this method, the dissimilarity measure is defined by taking into 
account both numeric and categorical attributes. The original k-means algorithm consists of two phases: one for 
determining the initial centroids and the other for assigning data points to the nearest clusters and then 
recalculating the cluster means. The second phase is carried out repetitively until the clusters get stabilized, i.e., 
data points stop crossing over cluster boundaries. Fang Yuan et al. [8] proposed a systematic method for finding 
the initial centroids. The centroids obtained by this method are consistent with the distribution of data. Hence it 
produced clusters with better accuracy, compared to the original k-means algorithm. However, Yuan's method 
does not suggest any improvement to the time complexity of the k-means algorithm. Fahim A M et al. proposed 
an efficient method for assigning data-points to clusters. The original k-means algorithm is computationally very 
expensive because each iteration computes the distances between data points and all the centroids. Fahim's 
approach makes use of two distance functions for this purpose- one similar to the k-means algorithm and 
another one based on a heuristics to reduce the number of distance calculations. But this method presumes that 
the initial centroids are determined randomly, as in the case of the original k-means algorithm. Hence there is no 
guarantee for the accuracy of the final clusters. 



IV. PROPOSED ALGORITHM 

In the novel clustering method discussed in this paper, both the phases of the original k-means 
algorithm are modified to improve the accuracy and efficiency. 
Input: 

D = {dl, d2, ,dn} // set of n data items 

k II Number of desired clusters 
Output: 

A set of k clusters. 
Steps: 

Phase 1: Determine the initial centroids of the clusters 
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Input: 

D = {dl, d2, ,dn} // set of n data items 

k II Number of desired clusters 
Output: A set of Ic initial centroids . 
Steps: 

1. Setm=l; 

2. Compute the distance between each data point and all other data- points in the set D; 

3. Find the closest pair of data points from the set D and form a data -point set Am (1<= m <= k) which contains 
these two data- points, Delete these two data points from the set D; 

4. Find the data point in D that is closest to the data point set Am, Add it to Am and delete it from D; 

5. Repeat step 4 until the number of data points in Am reaches 0.75*(n/k); 

6. If m<k, then m = m+1, find another pair of data points from D between which the distance is the 
shortest, form another data-point set Am and delete them from D, Go to step 4; 

7. for each data-point set Am (l<=m<=k) find the arithmetic mean of the vectors of data points in Am, 
These means will be the initial centroids. 

Phase 2: Assign each data point to the appropriate clusters 

Input: 

D = {dl, d2, ,dn} // set of n data-points. 

C = {cl, c2, ,ck} // set of k centroids 

Output: 

A set of k clusters 
Steps: 

1. Compute the distance of each data-point di (l<=i<=n) to all the centroids cj (l<=j<=k) as d(di, cj); 

2. For each data-point di, find the closest centroid cj and assign di to cluster j. 

3. Set Clusterld[i]=j; // j:Id of the closest cluster 

4. Set Nearest_Dist[i]= d(di, cj); 

5. For each cluster j (l<=j<=k), recalculate the centroids; 

6. Repeat 

7. For each data-point di, 

7.1 Compute its distance from the centroid of the present nearest cluster; 

7.2 If this distance is less than or equal to the present nearest distance, the data-point stays in the 
cluster; 

Else 

7.2.1 For every centroid cj (l<=j<=k) Compute the distance d(di, cj); 

Endfor; 

7.2.2 Assign the data-point di to the cluster with 
the nearest centroid cj 

7.2.3 Set Clusterld[i]=j; 

7.2.4 Set Nearest_Dist[i]= d(di, cj); 
Endfor; 

8. For each cluster j (l<=j<=k), recalculate the centroids; Until the convergence criteria is met. 

In the first phase, the initial centroids are determined systematically so as to produce clusters with 
better accuracy [8]. The second phase makes use of a variant of the clustering method discussed in . It starts by 
forming the initial clusters based on the relative distance of each data-point from the initial centroids. These 
clusters are subsequently fine-tuned by using a heuristic approach, thereby improving the efficiency. In this 
phase initially, compute the distances between each data point and all other data points in the set of data points. 
Then find out the closest pair of data points and form a set Al consisting of these two data points, and delete 
them from the data point set D. Then determine the data point which is closest to the set Al, add it to Al and 
delete it from D. Repeat this procedure until the number of elements in the set Al reaches a threshold. At that 
point go back to the second step and form another data-point set A2. Repeat this till 'k' such sets of data points 
are obtained. Finally the initial centroids are obtained by averaging all the vectors in each data-point set. The 
Euclidean distance is used for determining the closeness of each data point to the cluster centroids. The distance 

between one vector X = (xl, x2, ....xn) and another vector Y = (yl, y2, yn) is obtained as d(X ,Y) = Square 

root of {(xl- yl)2 + (x2 - y2)2 + .... + (xn - yn )2} The distance between a data point X and a data-point set D 
is defined as d(X, D) = min (d (X, Y ), where Y £D).The initial centroids of the clusters are given as input to the 
second phase, for assigning data-points to appropriate clusters. 
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The first step in Phase 2 is to determine the distance between each data-point and the initial centroids of 
all the clusters. The data-points are then assigned to the clusters having the closest centroids. This results in an 
initial grouping of the data-points. For each data-point, the cluster to which it is assigned (Clusterld) and its 
distance from the centroid of the nearest cluster (Nearest_Dist) are noted. Inclusion of data-points in various 
clusters may lead to a change in the values of the cluster centroids. For each cluster, the centroids are 
recalculated by taking the mean of the values of its data-points. Up to this step, the procedure is almost similar 
to the original k-means algorithm except that the initial centroids are computed systematically. The next stage is 
an iterative process which makes use of a heuristic method to improve the efficiency. During the iteration, the 
data-points may get redistributed to different clusters. The method involves keeping track of the distance 
between each data-point and the centroid of its present nearest cluster. At the beginning of the iteration, the 
distance of each data-point from the new centroid of its present nearest cluster is determined. If this distance is 
less than or equal to the previous nearest distance, that is an indication that the data point stays in that cluster 
itself and there is no need to compute its distance from other centroids. This result in the saving of time required 
to compute the distances to k-1 cluster centroids. On the other hand, if the new centroid of the present nearest 
cluster is more distant from the data-point than its previous centroid, there is a chance for the data-point getting 
included in another nearer cluster. In that case, it is required to determine the distance of the data-point from all 
the cluster centroids. Identify the new nearest cluster and record the new value of the nearest distance. The loop 
is repeated until no more data-points cross cluster boundaries, which indicates the convergence criterion. The 
heuristic method described above results in significant reduction in the number of computations and thus 
improves the efficiency. 

V. CONCLUSION 

The k-means algorithm is widely used for clustering large sets of data. But the standard algorithm do 
not always guarantee good results as the accuracy of the final clusters depend on the selection of initial 
centroids. Moreover, the computational complexity of the standard algorithm is objectionably high owing to the 
need to reassign the data points a number of times, during every iteration of the loop. This paper presents an 
enhanced k-means algorithm which combines a systematic method for finding initial centroids and an efficient 
way for assigning data points to clusters. This method ensures the entire process of clustering in 0(n2) time 
without sacrificing the accuracy of clusters. The previous improvements of the k-means algorithm compromise 
on either accuracy or efficiency.A limitation of the proposed algorithm is that the value of k, the number of 
desired clusters, is still required to be given as an input, regardless of the distribution of the data points. 
Evolving some statistical methods to compute the value of k, depending on the data distribution, is suggested for 
future research. A method for refining the computation of initial centroids is worth investigating. 
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